I’m going to try working through this lab as I also try to update it. The original lab mostly is good, but it teaches some less than useful or appropriate skills. So I’m going to replace some components with some ideas of my own. As I do this, I want to make sure everything works.
First, I need to read in the data from my working directory.
Here are the packages I need:
library(tidyverse)
library(modelr)
And here’s the data:
dt <- read_csv("Columbus_2020_listings.csv")
There are a bunch of variables in here:
dim(dt)
## [1] 1409 106
Okay, time for some brain work. I’m telling students to narrow down to 5-7 variables that they think will be most relevant for predicting housing price. I should probably do the same. There’s a lot in here, and many more than 7 variables are probably predictive. Here are some promising ideas:
Host details
Neighborhood details
Property details
Pricing
Satisfaction
Misc
Okay, so I have my picks. Now I need to do some cleaning and refining:
dt %>%
transmute(
## the outcome(s)
across(c(price,
security_deposit,
cleaning_fee),
~ str_remove(.x, "\\$") %>%
as.numeric()),
## the predictors
host_acceptance_rate =
str_remove(host_acceptance_rate, "\\%") %>%
as.numeric(),
host_is_superhost = host_is_superhost+0,
neighbourhood_cleansed,
room_type,
number_of_reviews,
review_scores_rating,
cancellation_policy,
## keep location details for mapping
longitude,
latitude
) -> dtclean
glimpse(dtclean)
## Rows: 1,409
## Columns: 12
## $ price <dbl> 198, 47, 80, 90, 30, 60, 75, 199, 235, 75, 32, …
## $ security_deposit <dbl> 350, 0, 0, 125, 0, 0, 0, 0, 200, 0, 0, NA, NA, …
## $ cleaning_fee <dbl> 70, 20, 25, 75, 0, 0, 5, 125, 75, 5, 0, NA, NA,…
## $ host_acceptance_rate <dbl> 96, 100, 100, 87, 98, 98, 100, 100, 54, 100, 98…
## $ host_is_superhost <dbl> 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1,…
## $ neighbourhood_cleansed <chr> "Near North/University", "Near North/University…
## $ room_type <chr> "Entire home/apt", "Private room", "Private roo…
## $ number_of_reviews <dbl> 378, 97, 220, 131, 218, 44, 165, 131, 64, 131, …
## $ review_scores_rating <dbl> 96, 93, 98, 98, 95, 94, 99, 100, 97, 99, 97, 90…
## $ cancellation_policy <chr> "strict_14_with_grace_period", "moderate", "mod…
## $ longitude <dbl> -83.00321, -83.00986, -82.97968, -83.00067, -83…
## $ latitude <dbl> 39.98394, 40.01243, 39.96086, 40.01898, 40.0125…
The room type and cancellation policy columns may make more sense as ordered categories:
## how many unique categories
table(dtclean$room_type) # 4
##
## Entire home/apt Hotel room Private room Shared room
## 1043 3 346 17
table(dtclean$cancellation_policy) # 4 (but really 3)
##
## flexible moderate
## 433 508
## strict_14_with_grace_period super_strict_30
## 467 1
## okay let's update the order
library(socsci)
dtclean %>%
mutate(
room_type = frcode(
room_type == "Shared room" ~ "Shared room",
room_type %in%
c("Hotel room", "Private room") ~ "Hotel/Private room",
room_type == "Entire home/apt" ~ "Entire home/apt"
),
cancellation_policy = frcode(
cancellation_policy == "flexible" ~ "Flexible",
cancellation_policy == "moderate" ~ "Moderate",
TRUE ~ "Strict/Super strict"
)
) -> dtclean
I need to make two polished visuals showing the relationship between some predictors and price.
Visual 1:
coolorrr::set_theme()
dtclean %>%
group_by(room_type) %>%
mean_ci(log(price)) %>%
ggplot() +
aes(x = exp(mean),
xmin = exp(lower),
xmax = exp(upper),
y = room_type,
label = paste0("N = ",
scales::comma(n))) +
geom_pointrange(
aes(size = n),
show.legend = F
) +
scale_size_continuous(
range = c(0.1, 1)
) +
geom_text(vjust = 1.75) +
scale_x_continuous(
labels = scales::dollar
) +
labs(
x = "Avg. Housing Price",
y = NULL,
title = "AirBnB Price by Kind of Housing"
)
Visual 2:
ggplot(dtclean %>%
filter(host_acceptance_rate>0)) +
aes(x = host_acceptance_rate,
y = price) +
geom_jitter(
aes(y = 40),
color = "darkgray",
alpha = 0.25,
height = 5,
width = 1
) +
scale_y_continuous(
breaks = seq(40, 150, by = 10),
labels = scales::dollar
) +
stat_smooth(
method = "glm",
method.args = list(family = poisson),
formula = y ~ x,
color = "darkblue"
) +
labs(
x = "Host Acceptance Rate (%)",
y = "Price of Housing",
title = "Higher acceptance = higher price",
subtitle = "Poisson line of best fit"
)
Now I need to make a map of Columbus to summarize housing locations.
First I need to install {ggmap} by running
devtools::install_github("dkahle/ggmap"). Then I can open
it:
library(ggmap)
Now I need the coordinates for Columbus:
## coordinates
map <- c(left=-83.2, bottom=39.8, right=-82.75, top=40.16)
## state map
cbus_map <- get_stamenmap(map, zoom = 10, maptype = "toner")
Now let’s check it out:
ggmap(cbus_map) +
theme_void()
That’s cool. So I should be able to add the coordinates of different AirBnBs by writing:
ggmap(cbus_map) +
geom_point(
data = dtclean,
aes(x = longitude,
y = latitude,
size = price),
alpha = 0.05,
color = "darkblue",
show.legend = F
) +
labs(title = "Location of AirBnBs in Columbus",
subtitle = "Larger points indicate higher prices") +
theme(
axis.line = element_blank(),
axis.text = element_blank(),
axis.title = element_blank()
)
It looks like it’s expensive to stay close to downtown. It may be worth adding a column that’s distance to the city center:
ggmap(cbus_map)
library(geosphere)
## lon = -83 : lat = 39.96
ccenter <- c(-83, 39.96)
dtclean$distance <- with(dtclean, distHaversine(ccenter, cbind(longitude, latitude))) / 1000
ggplot(dtclean) +
aes(
x = distance,
y = price
) +
geom_point() +
scale_y_continuous(
labels = scales::dollar
) +
#scale_x_log10() +
labs(
x = "Distance from city center (km)",
y = "Price",
title = "It pays to have a location downtown"
)
We can make a simple bivariate regression modeling the log of prices as a function of the host acceptance rate:
simple_fit <- lm(price ~ distance, dtclean)
summary(simple_fit)
##
## Call:
## lm(formula = price ~ distance, data = dtclean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -132.22 -78.76 -45.29 14.68 866.59
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 176.4357 6.3730 27.685 <2e-16 ***
## distance -8.1047 0.9453 -8.574 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 155.4 on 1369 degrees of freedom
## (38 observations deleted due to missingness)
## Multiple R-squared: 0.05096, Adjusted R-squared: 0.05027
## F-statistic: 73.52 on 1 and 1369 DF, p-value: < 2.2e-16
library(GGally)
ggpairs(dtclean %>% select(security_deposit,
host_acceptance_rate,
host_is_superhost,
distance)) +
theme_classic()
library(modelr)
## How about we just compare 4 models?
## Step 1: use MC cross-validation
cv_dt <- dtclean %>%
crossv_mc(n = 500)
## Step 2: specify a list of model specifications
spec_list <- list(
log(price) ~ 1, # NULL model for reference
log(price) ~ room_type,
log(price) ~ room_type + distance,
log(price) ~ room_type + host_is_superhost,
log(price) ~ room_type + host_acceptance_rate
)
## Step 3: train
cv_dt %>%
expand_grid(., specs = spec_list) %>%
mutate(
model_num = rep(1:5, len = n()),
models = map2(specs, train, ~ lm(.x, .y))
) -> cv_out
## Step 4: validate performance
cv_out %>%
mutate(
rmse = map2(models, test, ~ rmse(.x, .y))
) -> cv_out
## Step 5: collect the columns I need
cv_out %>%
select(.id, model_num, rmse) %>%
unnest -> cv_results
## Step 6: visualize the results
cv_results %>%
group_by(model_num) %>%
mutate(
median = mean(rmse)
) %>%
ggplot() +
aes(x = model_num,
y = rmse) +
geom_jitter(
color = "darkgray",
width = .1,
alpha = 0.4
) +
geom_line(
aes(y = median)
) +
geom_point(
aes(y = median),
color = "red",
size = 2
) +
scale_x_continuous(
labels = c("NULL",
"type",
"type + distance",
"type + superhost",
"type + acceptance")
) +
labs(x = NULL,
y = "RMSE",
title = "MC Cross-validation Results",
caption = "(500 MC iterations)")
Seems like using the housing type and distance to downtown do the best!
So we know this is the best:
modelfit <- lm(log(price) ~ room_type + distance, dtclean)
summary(modelfit)
##
## Call:
## lm(formula = log(price) ~ room_type + distance, data = dtclean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.3948 -0.3906 -0.1062 0.3129 2.3008
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.489685 0.155789 22.400 < 2e-16 ***
## room_typeHotel/Private room 0.580033 0.157774 3.676 0.000246 ***
## room_typeEntire home/apt 1.415828 0.155372 9.112 < 2e-16 ***
## distance -0.022971 0.004055 -5.666 1.78e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6347 on 1367 degrees of freedom
## (38 observations deleted due to missingness)
## Multiple R-squared: 0.3127, Adjusted R-squared: 0.3112
## F-statistic: 207.3 on 3 and 1367 DF, p-value: < 2.2e-16
Low/med/high price predictions:
predata <- tibble(
room_type = sort(unique(dtclean$room_type)),
distance = c(20, 10, 0)
)
preds <- paste0("$",round(exp(predict(modelfit, predata)), 2))
cbind(price = preds, predata)
## price room_type distance
## 1 $20.7 Shared room 20
## 2 $46.53 Hotel/Private room 10
## 3 $135.03 Entire home/apt 0
You would net the highest nightly rate with an AirBnB that’s an entire home/apartment close to the city center. If you only can offer a shared room in a far-out suburb, good luck making much money!